Key Technologies of big data

Source: Internet
Author: User

In the big data era, are traditional data processing methods still applicable?

Data processing requirements in Big Data Environments

In the big data environment, data sources are very rich and diverse data types. The data volume stored and analyzed and mined is huge, and the data presentation requirements are high. In addition, the efficiency and availability of data processing are very important.

Disadvantages of traditional data processing methods

Traditional data collection sources are single, and the storage, management, and analysis data volume is relatively small. Most of them can be processed using relational databases and parallel data warehouses. In terms of improving data processing speed by using parallel computing, the traditional parallel database technology pursues High Consistency and fault tolerance. According to the CAP theory, it is difficult to guarantee its availability and scalability.

Traditional data processing methods are processor-centric, while in Big Data Environments, data-centric models are required to reduce the overhead of data movement. Therefore, traditional data processing methods cannot meet the needs of big data!

What are the processes of big data processing? What are the main tools for each link?

There is no big difference between the basic Big Data Processing Process and the traditional data processing process. The main difference is that big data needs to process large amounts of unstructured data, therefore, mapreduce and other methods can be used for Parallel Processing in each processing stage.

Why does Big Data Technology Speed up data processing?

Mapreduce, a powerful tool for parallel processing of big data

Big data can be processed together with mapreduce to speed up data processing. Mapreduce was designed to achieve parallel big data processing through a large number of cheap servers, but it does not require high data consistency. Its outstanding advantages are scalability and availability, it is particularly suitable for the Hybrid Processing of massive structured, semi-structured, and unstructured data.

Mapreduce distributes traditional queries, decomposition, and data analysis in a distributed manner and distributes processing tasks to different processing nodes. Therefore, mapreduce is more capable of parallel processing. As a simplified programming model for parallel processing, mapreduce also lowers the threshold for developing parallel applications.

Mapreduce is a software framework that consists of two phases: map and reduce. It can split massive data, separate tasks, and summarize results to complete parallel processing of massive data.

The principle of mapreduce is actually the data processing method after splitting. Map refers to "decomposition", which divides massive data into several parts and distributes them to multiple processors for parallel processing. Reduce refers to "merge ", summarize the processing results of each processor to obtain the final result. As shown in the figure on the right, if mapreduce is used to calculate the number of different geometric shapes, it will first allocate the task to two nodes for parallel statistics by the two nodes, and then summarize their results, obtain the final calculation result.

Mapreduce is suitable for data analysis, log analysis, business intelligence analysis, customer marketing, large-scale indexing and other services, and has very obvious results. Using mapreduce Technology for real-time analysis, a household appliance company's credit computing time is shortened from 33 hours to 8 seconds, while MkI's genetic analysis time is shortened from several days to 20 minutes.

Here, let's take a look at the difference between mapreduce and the traditional distributed parallel computing environment MPI? Mapreduce is significantly different from MPI in terms of its design purpose, usage, and support for file systems, so that it can better adapt to the processing needs in the big data environment.

What new methods does Big Data technology use in data collection?

System log collection method

Many Internet enterprises have their own massive data collection tools, which are mostly used for System Log collection, such as hadoop's chukwa, cloudera's flume, and Facebook's scribe. These tools all adopt a distributed architecture, it can meet the log data collection and transmission requirements of hundreds of MB per second.

Network data collection method: Collection of unstructured data

Network Data Collection means to obtain data information from a website through web crawlers or open APIs. This method extracts unstructured data from webpages, stores it as a unified local data file, and stores it in a structured manner. It supports the collection of images, audio, video, and other files or attachments. attachments can be automatically associated with the body.

In addition to the content contained in the network, you can use bandwidth management technologies such as DPI or DFI to collect network traffic.

Other data collection methods

Data with high confidentiality requirements, such as production and operation data or subject research data, can be collected through cooperation with enterprises or research institutions, using specific system interfaces and other related methods.

 

This article is excerpted from "Big Data-big value, big opportunity, big change (full color)"

Edited by Li Zhigang

Published by Electronic Industry Publishing House

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.