In the big data era, are traditional data processing methods still applicable?
Data processing requirements in Big Data Environments
In the big data environment, data sources are very rich and diverse data types. The data volume stored and analyzed and mined is huge, and the data presentation requirements are high. In addition, the efficiency and availability of data processing are very important.
Disadvantages of traditional data processing methods
Traditional data collection sources are single, and the storage, management, and analysis data volume is relatively small. Most of them can be processed using relational databases and parallel data warehouses. In terms of improving data processing speed by using parallel computing, the traditional parallel database technology pursues High Consistency and fault tolerance. According to the CAP theory, it is difficult to guarantee its availability and scalability.
Traditional data processing methods are processor-centric, while in Big Data Environments, data-centric models are required to reduce the overhead of data movement. Therefore, traditional data processing methods cannot meet the needs of big data!
What are the processes of big data processing? What are the main tools for each link?
There is no big difference between the basic Big Data Processing Process and the traditional data processing process. The main difference is that big data needs to process large amounts of unstructured data, therefore, mapreduce and other methods can be used for Parallel Processing in each processing stage.
Why does Big Data Technology Speed up data processing?
Mapreduce, a powerful tool for parallel processing of big data
Big data can be processed together with mapreduce to speed up data processing. Mapreduce was designed to achieve parallel big data processing through a large number of cheap servers, but it does not require high data consistency. Its outstanding advantages are scalability and availability, it is particularly suitable for the Hybrid Processing of massive structured, semi-structured, and unstructured data.
Mapreduce distributes traditional queries, decomposition, and data analysis in a distributed manner and distributes processing tasks to different processing nodes. Therefore, mapreduce is more capable of parallel processing. As a simplified programming model for parallel processing, mapreduce also lowers the threshold for developing parallel applications.
Mapreduce is a software framework that consists of two phases: map and reduce. It can split massive data, separate tasks, and summarize results to complete parallel processing of massive data.
The principle of mapreduce is actually the data processing method after splitting. Map refers to "decomposition", which divides massive data into several parts and distributes them to multiple processors for parallel processing. Reduce refers to "merge ", summarize the processing results of each processor to obtain the final result. As shown in the figure on the right, if mapreduce is used to calculate the number of different geometric shapes, it will first allocate the task to two nodes for parallel statistics by the two nodes, and then summarize their results, obtain the final calculation result.
Mapreduce is suitable for data analysis, log analysis, business intelligence analysis, customer marketing, large-scale indexing and other services, and has very obvious results. Using mapreduce Technology for real-time analysis, a household appliance company's credit computing time is shortened from 33 hours to 8 seconds, while MkI's genetic analysis time is shortened from several days to 20 minutes.
Here, let's take a look at the difference between mapreduce and the traditional distributed parallel computing environment MPI? Mapreduce is significantly different from MPI in terms of its design purpose, usage, and support for file systems, so that it can better adapt to the processing needs in the big data environment.
What new methods does Big Data technology use in data collection?
System log collection method
Many Internet enterprises have their own massive data collection tools, which are mostly used for System Log collection, such as hadoop's chukwa, cloudera's flume, and Facebook's scribe. These tools all adopt a distributed architecture, it can meet the log data collection and transmission requirements of hundreds of MB per second.
Network data collection method: Collection of unstructured data
Network Data Collection means to obtain data information from a website through web crawlers or open APIs. This method extracts unstructured data from webpages, stores it as a unified local data file, and stores it in a structured manner. It supports the collection of images, audio, video, and other files or attachments. attachments can be automatically associated with the body.
In addition to the content contained in the network, you can use bandwidth management technologies such as DPI or DFI to collect network traffic.
Other data collection methods
Data with high confidentiality requirements, such as production and operation data or subject research data, can be collected through cooperation with enterprises or research institutions, using specific system interfaces and other related methods.
This article is excerpted from "Big Data-big value, big opportunity, big change (full color)"
Edited by Li Zhigang
Published by Electronic Industry Publishing House