Hadoop is widely used in the industry due to its extensive practicality and good usability in the field of big data processing. Since it was introduced in 2007, Hadoop has gained widespread attention and research from academia. In just a few years, Hadoop quickly became the most successful and widely accepted big data processing mainstream technology and system platform by far and has become a de facto industry standard for big data processing, gaining industry in large numbers Further development and improvement, and in the industry and application industries, especially the Internet industry has been widely used. Due to a lack of system performance and functionality, Hadoop continues to evolve throughout the development process, and has shipped dozens of versions since the first release in 2007.
However, due to its initial design goals for high-throughput offline batches and the many congenital deficiencies in initial system architecture design, the Hadoop open source community has been working to continually improve and refine the system, but previous versions of Hadoop 1.X There are many widely criticized defects, including the single-point bottleneck of the master node can easily lead to system congestion and failure, the job execution response performance is low and difficult to meet the needs of real-time low-latency data query and analysis and processing. The fixed Map and Reduce two-stage model framework Difficult to provide flexible programming capabilities, it is difficult to support efficient iterative computing, streaming computing, graph data calculation and other computing and programming model.
To this end, Hadoop open source community began designing a new architecture Hadoop system after Hadoop0.20 release, and in October 2011 launched a new generation of Hadoop0.23.0 beta based on architecture, this version evolved into Hadoop2. 0 version, a new generation of Hadoop system YARN. The YARN architecture separates the master node from the resource management and job management functions. The YARN architecture introduces the Resource Manager and Application Master for each job to reduce the overhead The high availability (HA) of Hadoop systems is enhanced by the resource manager's failure recovery based on Zookeeper. YARN also introduced the concept of Resource Container, which divides and encapsulates system computing resources into many resource units in a unified manner. Instead of differentiating Map and Reduce computing resources as in previous versions, YARN also improves the utilization of computing resources. In addition, YARN also accommodates a wide range of parallel computing models and frameworks beyond MapReduce, increasing the flexibility of parallel programming of Hadoop frameworks.
At the same time, outside of the Hadoop open source community, people are continually researching and rolling out systems that support different big data computing models due to the Hadoop systems and framework's lack of support for different models of big data computing. Among them, the Spark system, which is currently the most widely researched and developed by the University of California, Berkeley's Algorithms, Machines, and People Lab, supports a wide range of batch processing, memory computation, streaming computation, iterative computation, Figure data calculation and many other computing models. However, because Hadoop system has the advantages of large-scale data distribution storage and batch processing, as well as the scalability and ease of use of the system, it still has many advantages that other systems can not have. And in recent years, Hadoop development and applications already have a large amount of pre-investment and on-line application system, as well as Hadoop formed by a rich variety of tools, the complete ecosystem of software, but also with Hadoop itself to the next generation system evolution and continuous improvement in For a long time to come, the Hadoop system will continue its mainstream technology and platform in big data processing. Meanwhile, various other new systems will gradually merge with and co-exist with the Hadoop system.
While the open source Hadoop system is evolving, there are a number of companies in the industry that are developing a series of commercial versions of the open source-based Hadoop system. They focus heavily on open source systems for system performance optimization, system availability and reliability, and system enhancements Research and product development to create a commercial release. The most widely known is the Cloudera company, founded by the founders of the Hadoop system, to research and develop the Hadoop commercial distribution CDH, which has been well promoted in many industries in the United States. Since 2009, Intel has also researched and developed the Intel Hadoop system IDH, which has been well popularized and applied in many large-scale application industries in China. Chapter 7 of this book details the main features of Intel Hadoop system in performance optimization and function enhancement Technical content and methods of use.