Hadoop is an open source distributed computing platform, which consists of two parts: MapReduce algorithm execution and a distributed file system. Infoq has published a review of the speed of Hadoop, written by Jeremy Zawodny. This time, Infoq's senior Java editor Scott Delap and Hadoop project director Doug cutting an interview. In this INFOQ interview, cutting discusses how Hadoop is used in Yahoo, and the challenges of Hadoop development and the future direction of the Hadoop project.
Scott Delap (SD): is Hadoop already serving some of Yahoo's features as a formal product? If not, what are the plans for Hadoop to migrate from an experimental product to a core infrastructure component?
Doug Cutting (DC): Yahoo regularly uses Hadoop on its search business to improve its products and services, such as ranking features and target ads. In addition, there are some cases where data is generated directly using Hadoop. The long-term goal of Hadoop is to provide world-class distributed computing tools, as well as Web extension (Web-scale) services that support the next generation of business, such as search results analysis.
How big is Sd:yahoo's team in charge of the Hadoop project? Other than Yahoo insiders, how many active code contributors are there?
Dc:yahoo has a special team directly responsible for the development of Hadoop, while the active contributors to the Apache Open source project generally have their first careers. Even so, some non-Yahoo staff make their own contributions to Hadoop monthly, weekly and even daily.
SD: A different approach than Google,yahoo insists on an extensible infrastructure. Although Google has published a number of technical papers, its significance to the general public is not very obvious. And why do you think open source is the right direction?
DC: Open source projects get the best run to meet two conditions: first, everyone has a common understanding of what the project can do. Second, a set of easy to understand document solutions. Since infrastructure software is widely used in many areas, this kind of open source software is developing exceptionally well. Yahoo is using and supporting such infrastructure software as FreeBSD, Linux, Apache, PHP, and MySQL. Allow anyone to use Hadoop to help Yahoo improve its status quo and improve its current level of building large distributed systems. The source code is just a small part of the puzzle, and in addition, an organization needs a very strong team of engineers to solve major problems and put them into practice. The ability to properly publish and manage the infrastructure is also very important. Few companies currently have all the necessary resources. So, software engineers are willing to work for open source projects, they can meet a lot of like-minded friends in a huge community, learn some shared skills and apply them to other projects in the future. Such a good community environment can easily produce many new outstanding engineers. Both the Yahoo and Hadoop communities benefit from this collaborative mechanism, understand what is needed for large-scale distributed computing, and share our expertise and expertise in creating a solution that everyone can use and modify.
SD: Back to the technology itself, in recent years with the development of Hadoop, you feel the impact of its speed and stability of the elements? I find that 500 records are now sorted 20 times times faster than last year, as a result of a certain part of a huge ascension or a common optimization of multiple parts?
DC: As other companies and organizations that use this solution continue to grow, Yahoo finds similar performance in the process of processing Web extension service software. Yahoo decided to open its source, not as a proprietary software. So Yahoo hired me to lead the project. So far, Yahoo has contributed most of the code.
As for speed, it is a summation of the efforts of the past few years and has undergone repeated trials. In a server cluster of a given size, we can make the system run very smoothly, and then test what happens when you run in a server cluster twice the size. Our goal is to allow performance to increase linearly with cluster size. We continue to learn from this process and increase the size of the cluster again. With each increase in the size of the cluster, more numbers and more kinds of bugs will increase accordingly, so stability will be a major problem.
Each time we do this, we can understand what is achievable and what experience can contribute to the open source grid computing public Knowledge Base. As the size of the server cluster grows, new failures are generated, and rare errors become common errors that need to be addressed. And what this process has learned will affect our next experiment.
SD: Hadoop was able to run on Amazon EC2 last year. This will allow developers to quickly build their own server clusters. What additional work is needed to manage such a cluster, HDFS and mapreduce processing?
Dc:yahoo has a project called Hod (Hadoop on Demand) that allows MapReduce to run on very common machines. This is an open source project that is in the process of construction. Since running a large cluster is very complex and resource-constrained, Amazon EC2 is a very good platform for the general public to get in touch with Hadoop.
SD: How do you objectively compare with the products that Google has released on the Hadoop feature? Are there any new features in the process of optimizing the solution from the program unit to the data unit?
DC: Over the past decade, many large companies (including Yahoo) and some theoretical research institutes have been developing and researching large-scale distributed computing software. The interest in development and research has increased recently as economic computing has emerged in the consumer market. Unlike Google, Yahoo has developed a fully open-source Hadoop that allows anyone to use and modify the software for free. The goal of Hadoop has been extended beyond any existing technology replica. We are committed to building Hadoop into a system that is useful to anyone. We've achieved most of what Google has already released, plus a lot of other things that are not mentioned. Yahoo will play a leading role in this project because its goals coincide with our needs and we understand the significance of sharing this technology to the world.
SD: The latest official version is 0.13.1. Will there be any significant new features in the future? What kind of work will be done in version 1.0.
There will be as many as 218 changes in the dc:0.14.0 version. One of the biggest changes to the system is that we have directly improved the integrity of the data. This is an invisible change for the user, but it is effective for the future development of the system as a whole. Because of the size of the data and the cluster, it is a crisis that problems occur frequently in both memory and disk. We have also added functionality to change file time, some MapReduce C + + API functions, and additional host features, as well as bug positioning and fixing.
Hadoop 0.15.0 is also taking shape, with 88 changes planned. This version will increase the authentication and authorization of the file system and make information access between the same server clusters more secure. We also plan to revise a large number of MapReduce APIs. 0.15.0 will be a very difficult version, because it requires users to make changes to their applications and we hope to be in place in one step. We also hope that 0.15 will be the last version before 1.0. After 1.0 we will be very conservative and will not make drastic changes suddenly. We are also very concerned about backwards compatibility, which is more important for the 1.0 version. Any code written for version 1.0 will also continue to run after the 1.X version. So we need to make sure that our existing APIs can be easily extended to future versions. We will try to implement these in version 0.15.
Yahoo's Doug cutting on MapReduce and the Future of Hadoop