Massive data processing

Last Update:2016-04-30 Source: Internet

Author: User

Tags cassandra

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Mass data processing is based on the storage, processing, and operation of large amounts of information.

The so-called Mass, is a large amount of data, may be terabytes or even PB level, resulting in the inability to load memory at once or can not be processed in a short time to complete. In the face of massive data, we think of the simplest method is the division of the law, that is, separate treatment, large and small, small and governance. We can also think of cluster distributed processing.

1 Storage of massive data: preparing for Big data analytics Traditional relational databaseTraditional relational database is mainly oriented to structured data in data storage, focusing on the ability of convenient data query and analysis, the ability of fast processing transaction (transaction) according to strict rules, multi-user concurrent access ability and the guarantee of data security. Its structured data organization, strict consistency model, simple and convenient query language, strong data analysis ability and high program and data independence and other advantages have been widely used. However, the relational database for structured data storage cannot meet the demand of fast Internet data access and large-scale data analysis and mining. Its main disadvantage: 1) for semi-structured, unstructured mass data storage effect is not ideal. such as e-mail, hypertext, tags (tag), and pictures, audio and video, and other unstructured mass of data. 2) Relational models constrain the ability to access large amounts of data quickly: a relational model is a model that is accessed by content. That is, in a traditional relational database, the corresponding row is positioned according to the value of the column. This access model, which introduces time-consuming input and output during data access, can affect the ability to quickly access. Although the traditional database system can reduce the number of data input and output in the query process by the technology of partitioning (horizontal partitioning and vertical partitioning) to decrease the response time and improve the data processing ability, the performance improvement of this partition is not significant at the scale of massive data. 3) in the massive scale, the traditional database one fatal weakness, is its expansibility is poor. non-centralized data storage management system1) Amazon's Dynamo:dynamo is Amazon's key-value-mode storage platform, with good usability and scalability, and good performance: read-write access in 99.9% of the response time is within 300ms. In Dynamo, data is organized by key/value pairs (key-value), primarily for the storage of raw data. Under this architecture, each node in the system can perceive each other and have strong self-management performance and no single point of failure. 2) Google's Bigtable Bigtable is a set of structured storage systems developed by Google. The data is stored in a multidimensional sequential table. The whole system adopts the traditional server group form, which consists of a master server and several sub-table servers, and uses the distributed lock service chubby for fault-tolerant management. This architecture separates storage (relying on GFS) and service management, simplifying management complexity, ease of maintenance, and human control. But because underlying storage relies on distributed file systems, BigTable can only be deployed in a cluster. HBase in Hadoop is the open source implementation of Google BigTable. 2) Facebook's Cassandra Cassandracassandra was originally developed by Facebook and turned into an open source project. is a structured data storage system implemented using peer to peer,p2p technology. Based on Amazon's proprietary, fully distributed dynamo, the data model of Google BigTable is based on the column family (columns Family). Peer-centric storage. Many aspects can be called Dynamo 2.0. Unlike Dynamo, Cassandra organizes data using a multidimensional table data model similar to Bigtable. Its main function is richer than dynamo, but the support is not as good as document storage MongoDB (open source product between relational database and non relational database, the most abundant function in non-relational database, most like relational database). The supported data structures are very loose and are JSON-like bjson formats, so you can store more complex data types. ) Main Features:
Distributed
Column-based structuring
High-stretch 2 massive data processing is how to quickly extract critical information from these massive amounts and then provide it to the user. Parallel Computing Solutions:One of the methods to solve large-scale data processing is parallel computing. By spreading large amounts of data across multiple nodes, the computation is parallelized and the computational resources of multiple machines are used to speed up data processing. At present, this parallel computing model is divided into three main categories: one is widely used in high-performance computing MPI technology, one is the Google/Yahoo represented by the internet Internet mass data storage and processing technology overview of the rise of Map/reduce computing, a class of Microsoft proposed dryad[parallel computing model. 1) MPIMPI is a messaging interface (Messagepassing Interface), which is a programming interface standard rather than a specific programming language. MPI is an industry-standard API specification designed for high-performance computing on multiprocessor computers, computer clusters, and supercomputers. The standard was designed by a large number of computer vendors and software developers in 1994.
As one of the most popular parallel programming environments in the world, MPI is widely used in cluster high performance computing because of its advantages of portability, ease of use and complete asynchronous communication function. In the MPI-based programming model, the computational task consists of one or more processes that call the library function for message collection and communication. Most MPI implementations generate a fixed set of communication processes during program initialization. These processes run on different nodes (typically a processor-by-process), perform the same or different programs, interact between processes in a point-to-point communication or aggregate communication, and work together to accomplish the same computational task.
MPI, which is driven by message passing between tasks, is the basic idea of large-scale data processing, which divides the task into different computing parts which can be completed independently, and distributes the data that needs to be processed by each computing part to the corresponding compute nodes, respectively. After the calculation is complete, the individual nodes centralize their results to the primary compute node for the final summary of the results. 2) MapReduce MapReduce is a parallel computing model that Google proposed in 2004 for large scale data processing in large scale clusters. The concepts of map and Reduce (simplification), as well as their main ideas, come from functional languages.
In a computational task, the calculation is abstracted and simplified into two phases: Map and Reduce. Map phase, the system calls the user-supplied map function, completes the mapping from a set of key values to a new set of key values, while the reduce phase, the user-specified reduce function is used to make all map calculations completed the results of a simple regression. Unlike MPI, Map/reduce is done by distributing the calculation (map or reduce) to the corresponding data storage node or the adjacent node, allowing the compute (map or reduce) to be completed locally or near the data storage node, Minimize the pressure on the transmission of large amounts of data over the network. Detailed Documentation:Google's three core technologies (ii) Google mapreduce Chinese version3) DryadDryad is a data-parallel computing model presented by Microsoft in 2007. Currently in use at Microsoft Ad ' Center. Similar to the idea of MapReduce, Dryad also reduces the pressure on the network by moving computing tasks to the appropriate data storage nodes or near nodes, allowing the calculations to be done in-place or nearby. In Dryad, each calculation task is represented as a directed acyclic graph (Directed acyclic graph, DAG), and the computational task executes according to the direction of the directed acyclic graph in accordance with the dependency relationship. The DAG can express a richer calculation type compared to the two-stage MapReduce, and it supports the result passing through the TCP pipeline, the Shared-memory FIFOs (first-in-one in-place) between subtasks, and avoids some unnecessary disk input and output as much as possible. Accelerate the execution of calculations. If you consider processing massive amounts of data from data structures and algorithms:

Bloom Filter
Hash statistics and Mapping
Bit-map
Heap/quick/merge sort
Double Barrel Division
Database index
Inverted indexes (inverted index)
Sort outside
Trie Tree

These are for a specific scenario, as In a large amount of data (more than 10 million), the largest number of K or the highest frequency of the first K-bar text data.

Massive data processing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More