Key technologies for Big data
In big Data Environment, the data source is very rich and the data type is diverse, the data volume of storage and analysis mining is large, the requirement of data presentation is high, and the high efficiency and usability of processing are valued.
Shortcomings of traditional data processing methods
The traditional source of data acquisition is single, and the amount of storage, management and analysis data is relatively small, mostly using relational database and parallel data Warehouse can be processed. The traditional parallel database technology pursues high consistency and fault tolerance, according to the CAP theory, it is difficult to guarantee the availability and expansibility of the parallel computing to improve the data processing speed.
Traditional data processing method is processor-centric, and in the big Data environment, the need to take a data-centric mode, reduce the cost of data movement. Therefore, the traditional data processing methods can not adapt to the demand of big data.
The process of big data processing
The basic processing of big data and traditional data processing process is not very different, the main difference is: Because big data to deal with a large number of unstructured data, so in each processing link can be used in the same way as mapreduce parallel processing.
Why does big Data technology improve the speed of data processing ?
Big Data parallel processing weapon--mapreduce, big data can improve data processing speed by using MapReduce as a parallel processing technology. MapReduce is designed to achieve large data parallel processing through a large number of inexpensive servers, the data consistency requirements are not high, its prominent advantage is scalability and availability, especially for the massive structured, semi-structured and unstructured data mixed processing. MapReduce distributes traditional query, decomposition and data analysis to different processing nodes, so it has stronger parallel processing ability. As a simplified programming model for parallel processing, MapReduce also reduces the threshold for developing parallel applications.
MapReduce is a set of software framework, including map (map) and reduce (reduction) two stages, can be massive data segmentation, task decomposition and results summary, so as to complete the parallel processing of massive data.
the working principle of MapReduce: In fact, it is the data processing method of the First Division and the close. Map is "decomposition", the massive data into a number of parts, distributed to multiple processors parallel processing; reduce is "merging", the results of each processor processed to summarize operations to obtain the final result. As shown in the image on the right, if you use MapReduce to count the number of different geometries, it will first assign the task to two nodes, which are counted by two nodes separately, and then summarize the results to get the final result.
MapReduce is suitable for data analysis, log analysis, business intelligence analysis, customer marketing, large-scale indexing and other business, and has a very obvious effect. By combining MapReduce technology for real-time analysis, the credit calculation time of an appliance company has been shortened from 33 hours to 8 seconds, while MkI's genetic analysis time has been shortened from a few days to 20 minutes.
Here, let's look at the difference between MapReduce and the traditional distributed parallel computing environment MPI. MapReduce differs greatly from MPI in its design purpose, usage, and support for file systems, enabling it to be more adaptable to processing needs in big data environments.
What new methods are used in data acquisition by Big Data technology
1) System Log Collection method
Many internet companies have their own massive data acquisition tools, many for system log collection, such as Hadoop Chukwa,cloudera Flume,facebook scribe, these tools are distributed architecture, Can meet the log data acquisition and transmission requirements of hundreds of MB per second.
2) Network data acquisition method: The acquisition of unstructured data
Network data acquisition refers to the use of web crawlers or Web sites to open APIs and other ways to obtain data from the site. This method can extract unstructured data from a Web page, store it as a unified local data file, and store it in a structured way. It supports the collection of pictures, audio, video and other files or attachments, attachments and text can be automatically linked. In addition to what is included in the network, the acquisition of network traffic can be handled using bandwidth management techniques such as DPI or DFI.
3) Other data acquisition methods
For data such as enterprise production and management data or subject research data, such as high confidentiality requirements, can be used in cooperation with enterprises or research institutions, using a specific system interface and other related methods to collect data.
ref:http://blog.csdn.net/broadview2006/article/details/8124670
Big Data--key technologies for big data