Processing and analysis application of PB-level distributed large data

Source: Internet
Author: User
Keywords Can large data sensors owning
For large data, the serial processing method is difficult to meet people's requirements, and now mainly uses parallel computing. Existing parallel computing can be divided into two types:

Parallel computation of
fine granularity. Here the fine granularity is mainly refers to the instruction or the process level, because the GPU has the stronger parallel processing ability than the CPU, people will hand over some tasks to the GPU parallel processing, some GPU manufacturers also introduced the programming model which facilitates the programmer to use, like Nvidia launches Cuda and so on.

Parallel computation of
coarse granularity. Coarse granularity here refers to the task level, people will work distributed to different machines, the most recent popular grid computing, http://www.aliyun.com/zixun/aggregation/13452.html > Distributed computing is a coarse-grained level.


because the existing GPU programming model is not perfect, developers need to consider a large number of parallel details and the task is heavier, so it is not popular. Some of the new distributed programming models are popular with developers for their simplicity and convenience, and we discuss the parallel computing of coarse granularity.


because large data are distributed in clusters, therefore, the processing and analysis of data needs to be carried out in the cluster, but the analysis of distributed data on multiple machines generates a huge performance overhead, and even with gigabit or gigabit bandwidth networks, random read speed and sequential read speed can be several orders of magnitude slower than memory. However, the high-speed LAN technology now makes the network read faster than the hard drive to read much faster. As a result, storing data on other nodes is better than storing it on a hard disk, and you can process the dataset concurrently on multiple nodes.


to large data distribution processing will bring some problems, the first is the communication between nodes to the cost of parallel processing, some operations such as search, count, partial aggregation, joint, etc. can be performed independently on each node. The results of a single node processing need to be merged, so communication between nodes is unavoidable, but not all aggregation operations can be dispersed into sub operations that can operate independently, such as the median of all data. However, most important operations have distributed algorithms to reduce communication between nodes.


Load imbalance between nodes is also a major problem. Ideally, each node will have the same amount of computation, otherwise the most demanding node determines the completion time of the entire task, often longer than the load-balancing situation. At worst, all work is concentrated on one machine and cannot demonstrate the advantages of parallelism. How data is distributed between nodes affects load balancing, for example, a dataset that contains observations of 1000 sensors within 10 years, and the sensor collects data every 15 seconds, so that a sensor will produce Chice observations within 10 years. We distribute the data to 10 nodes based on the sensor and in chronological order, each containing 100 sensor observations, and if the data collected by a sensor operates, most of the nodes will be idle. If the data is distributed in chronological order, then the operation based on time can also cause load imbalance.

Another problem with
distributed systems is reliability. Just as an aircraft with four engines is more prone to engine failure than an aircraft with two engines, a cluster with 10 nodes is prone to node failure. This can be solved by replicating data between nodes, replicating data, which can improve the efficiency of data analysis, and can also cope with node failure through redundancy. Of course, the larger the dataset, the more difficult it is to manage and maintain the copy of the data.


at present, the application of large data processing and analysis is more focused on the technology, forecast analysis, real time analysis, business intelligence and data statistics. These demands are of great help to the enterprise.


It is not difficult to store petabytes of data, but how to store it efficiently is not easy. The first thing to consider is how to organize the data structure so that it can support the top software more, without the need to dump and rearrange the data. Avoid delays caused by staging, extraction, integration, etc. when data needs to be converted.


Effective predictive analysis technology, especially real-time analysis, is of great help to Enterprise's decision-making. For example, supermarkets can predict a user's propensity to buy the next item based on a large user history consumption record, so that they can print their interest coupons specifically for a particular user at checkout time. The football team management can recommend a more humane monthly ticket, season ticket and so on according to the user's purchase ticket record.


At present, traditional data analysis software, such as SAS and SPSS, is limited to the processing of large data because of its ability to compute data. Emerging data analysis software, such as IBM Netezza, often has to pay expensive licensing fees, so open source large data analysis tools such as Hadoop,mapreduce,r are gaining more and more attention and favor.


software, open source software is completely free of charge and does not need to pay expensive license fees, in addition to it has a large open source team support. But it is crucial to keep abreast of market demand and speed, which, after all, are not as powerful as commercial software that drives them forward.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.