Key technologies of large data processing

Source: Internet
Author: User
Keywords Large data processing traditional

is the traditional data processing method applicable in the large data age?

Data processing requirements in large-database environment

In large data environment, the data source is rich and the data types are diverse, the data stored and analyzed is huge, the requirement of data presentation is high, and the efficiency and usability of the processing are highly valued.

Shortcomings of traditional data processing methods

The traditional data acquisition source is single, and the volume of storage, management and analysis is relatively small, mostly using relational database and parallel data Warehouse can be processed. The traditional parallel database technology pursues high consistency and fault tolerance, and it is difficult to guarantee its usability and expansibility according to cap theory.

Traditional data processing methods are processor-centric, and in large data environment, data-centric mode is needed to reduce the cost of data movement. Therefore, the traditional data processing methods, can not adapt to the needs of large data!

What are the processes involved in the processing of large data? What are the main tools for each link?

The basic processing flow of large data and the traditional data processing flow are not much different, the main difference is: because large data to deal with a large number of unstructured data, so in each processing link can be used in parallel processing mapreduce.

Why does large data technology improve data processing speed?

Large Data parallel processing tool--mapreduce

Large data can improve the processing speed of data by mapreduce this parallel processing technique. MapReduce's design is designed to achieve large data parallel processing through a large number of Low-cost servers, the data consistency requirements are not high, its outstanding advantage is extensibility and usability, especially suitable for the mass of structured, semi-structured and unstructured data mixed processing.

MapReduce the traditional query, decomposition and data analysis for distributed processing, the processing tasks are assigned to different processing nodes, therefore has a stronger parallel processing capacity. As a simplified programming model for parallel processing, MapReduce also lowers the threshold for developing parallel applications.

MapReduce is a set of software framework, including map (mapping) and reduce (simplification) two stages, can be a large number of data segmentation, task decomposition and result aggregation, so as to complete the parallel processing of massive data.

The working principle of mapreduce is in fact the data processing method of the First Division. Map is "decomposition", the massive data is divided into several parts, divided into multiple processors parallel processing; Reduce is a "merge" that summarizes the results of each processor processing to obtain the final result. As shown on the right, if you use MapReduce to count the number of different geometry, it will first assign the task to two nodes, two nodes are divided into parallel statistics, and then the results of their summary, the final calculation results.

MapReduce is suitable for data analysis, log analysis, business intelligence analysis, customer marketing, large-scale indexing and other business, and has a very obvious effect. By combining MapReduce technology for real-time analysis, the credit calculation time of an electrical appliance company was shortened from 33 hours to 8 seconds, while MkI's gene analysis time was shortened from several days to 20 minutes.

Here, let's see what's the difference between mapreduce and traditional distributed parallel computing environment? MapReduce is very different from MPI in the aspects of its design purpose, usage and support of file system, so that it can adapt to the processing demand in large data environment.

What new methods are used in data acquisition

System Log Collection method

Many internet companies have their own data acquisition tools, more for system log collection, such as the Chukwa,cloudera of Hadoop Flume,facebook scribe, and so on, these tools are distributed architecture, Can meet hundreds of MB of log data acquisition and transmission requirements per second.

Network data acquisition Method: The collection of unstructured data

Network data acquisition means to obtain data from a Web site through a web crawler or a website public API. The method can extract unstructured data from a Web page, store it as a unified local data file, and store it in a structured way. It supports the collection of pictures, audio, video files or attachments, and the attachment can be automatically associated with the text.

In addition to the content contained in the network, the collection of network traffic can be processed using bandwidth management techniques such as DPI or DFI.

Other data acquisition methods

Data with high confidentiality requirements, such as business data or discipline research data, can be collected by cooperating with enterprises or research institutes, using specific system interfaces and other related methods.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.