Analysis and processing technology of Hadoop data

Source: Internet
Author: User
Keywords NBSP; DFS processing technology name hardware

Data analysis is the core of large data processing. The traditional data analysis is mainly aimed at the structured data, and the general process is as follows: firstly, the database is used to store the structured data, then the Data Warehouse is constructed, and then the corresponding cubes are constructed and the on-line analysis is processed according to the need. This process is very efficient when dealing with relatively small structured data. However, for large data, the analysis technology faces 3 intuitive problems: large-capacity data, multi-format data and analysis speed, which makes the standard storage technology can not store large data, so it is necessary to introduce a more reasonable analysis platform for large data analysis. At present, open source Hadoop is a widely used data processing technology, it is also the core technology to analyze and deal with large data.
Hadoop is a Java-based, distributed, data-intensive data processing and analysis software framework that allows users to develop distributed programs without knowing the underlying details of the distribution, taking full advantage of the power of the cluster for high-speed computing and storage. Its basic working principle is: the large-scale data decomposed into small, easy access to bulk data and distributed to multiple servers to analyze. Mainly including file system (HDFS), Data Processing (MapReduce) two parts of the functional modules, the bottom is HDFS to store all the storage nodes in the Hadoop cluster files, HDFS the upper layer is the MapReduce engine, the engine by the job trackers and task Trackers composition. Its composition schema is shown in the figure:


                                  & nbsp; Figure Hadoop makes up the architecture diagram
In view of the commercial hardware cluster. The so-called commercial hardware is not low-end hardware, its failure rate is much lower than low-end hardware. Hadoop does not need to run on expensive and highly reliable hardware, even if a large cluster with a high probability of node failure, HDFs can continue to run without a noticeable disruption to the user in the case of a failure, this design reduces the maintenance cost to the machine, Especially when the user manages hundreds or even thousands of machines.
Hadoop is designed with an efficient access pattern based on write-once, multiple-read. Each analysis of the data involves the entire dataset in which the data is located, and the high data throughput creates a high latency, and hbase is a better choice for low latency data access. HDFS uses the Master/slave architecture, a HDFS cluster consisting of a namenode (master) and multiple Datanode (slave). Namenode is a central server responsible for managing HDFS namespaces and maintaining HDFS files and directories. This information is persisted to the local disk in the form of a namespace-mirroring file and an edit log file. It also records the Datanode information of each block in each file, but does not permanently save the location information of the block because Datanode will re-establish the new location information at system startup. At the same time, Namecode is also responsible for controlling access by external client. The
DataNode is a hdfs working node, typically a machine node in the cluster, responsible for managing the storage that comes with the node. They store and retrieve blocks of data based on client needs or Namenode, execute commands to create, delete, and copy blocks, and periodically send dynamic information to Namenode to store lists of blocks of data, Namenode obtain each Datanode and verifies block mappings and file system metadata accordingly. The
3.2 MapReduce
MapReduce is a software framework for handling large data. Its core design idea is to divide the problem into chunks, to push the calculation to the data rather than to push the data to the calculation. The simplest MapReduce application consists of at least 3 parts: The map function, the Reduce function, and the main function, whose model is relatively simple and blocks the user's raw data.The map function is then given to the different map tasks to process the output intermediate results, the reduce function reads the list of data and sorts the data and outputs the final result. The process is shown in the following illustration:
3.3 Hadoop Advantages and problems
Hadoop is a software framework that enables distributed processing of large amounts of data, and is handled in a reliable, efficient, scalable manner. It is reliable because it assumes that the compute element and store will fail, so it maintains multiple copies of the work data, ensuring that the nodes are able to be distributed for the failed node, efficient because it works in parallel, speeding up processing through parallel processing, and scalability is that it can handle PB-level data.
But like other emerging technologies, Hadoop also faces problems that need to be addressed. (1) At present, Hadoop lacks enterprise-class data protection, developers must manually set HDFS data replication parameters, and rely on developers to determine that replication parameters are likely to lead to waste of storage space. Article Hadoop needs to invest in building a dedicated computing cluster, but this usually results in isolated storage, computational resources, and storage or CPU utilization issues, and this storage has compatibility issues with other programs sharing problems.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.