From the spark to the core, on the evolution of the large data users of Hadoop

Source: Internet
Author: User
Keywords them these cores now we
Tags beginning business code company computing computing power customers data

In the 8 years of Hadoop development, we've seen a "wave of usage"-generations of users using Hadoop at the same time and in a similar environment. Every user who uses Hadoop in data processing faces a similar challenge, either forced to work together or simply isolated in order to get everything working. Then we'll talk about these customers and see how different they are.

No. 0 Generation-kindling

This is the beginning: On the basis of Google's 2000-year research paper, some believers have laid the foundation for the commercialization of cheap storage and computing power.

Doug Cutting is the Godfather. Together with Mike Cafarella, he realized an open source version of Google's file system and MapReduce, which is part of the Apache Nutch project. Together, the two deduce the ubiquitous Apache Hadoop, a thriving large data ecosystem. Surprisingly, no other competition or business entity sees the potential of the technology and develops products that compete with it.

First generation-Early

The creation of Hadoop quickly attracted some early adopters, including Web2.0 and its company Yahoo!, Facebook, Powerset, Rapleaf, some of whom were more interested in the NoSQL component of Hadoop, the database of Hadoop ( Also known as HBase). They all need a user base platform that can help them cope with the current and growing fast. They bet that the things that make Google work can also meet their needs. Hadoop did it, and then today.

More importantly, these companies have a strong engineering background, with more developers than the general enterprise. Their technologists can use Hadoop within the company to develop solutions built on Hadoop. For engineers, the technology path begins to diverge: either start digging into the code and eventually building a project within a Hadoop ecosystem, or being grouped in a category that is both development and clustering ... We are witnessing the birth of the Hadoop development rules-the people involved should have a variety of skills and can shoulder all the burdens. This is useful because these lone warriors are gifted and able to do their job.

Both groups eventually facilitated the development of the Hadoop code base, and were therefore selected into the Hadoop submission team, who were allowed to examine the code submitted to the open Source Library. We're talking about a team of about 200 people who drive the development of Hadoop worldwide.

Now, some of these engineers have moved to other projects or jumped to other companies, but most of them are still active in Hadoop circles. Particularly noteworthy is the Yahoo! Company, which at the outset pushed the development of Hadoop.

Second generation-followers

In the early adopters of Hadoop, Hadoop was impressed by a new group of users who are often hired by companies that are now booming Web 3.0 and social networks. These users are the main cause of the formation of Hadoop and the advent of the times (though one is younger than the other). They usually don't have you accumulating rich Java code, but these guys with Python, Ruby or Scala logo "We write code fast" can crack any code with the help of energy drinks and endless time, except Java. So they built a great web site, such as Last.fm, Spotify, that quickly brought together things that Hadoop lacked, such as a Python MapReduce bridge called Dumbo (Last.fm), or Luigi (Spotify) New job scheduling system.

Now, this lack of Hadoop components is the result of modular development not only in young entrepreneurs, but also in other companies that are reluctant to intervene in the growing politicization of the Hadoop core developer community. LinkedIn is an example of a number of tools developed around the core services of Hadoop, and it also establishes a secondary system to help collect events, queue processing, and so on. LinkedIn will open up these projects to help interested users build new communities.

Third generation--late bloomer

So far, the next generation of users interested in all Hadoop projects is the so-called Enterprise company. They are from small to large, they are pure it users who buy the hardware and software licenses they need, and the architects will rub them into solutions, products or services. But they won't hire a bunch of core developers to patch or build Hadoop stacks. In fact, most of these users use a distributed set of Hadoop, such as Cloudra CDH to make Hadoop run faster. This is the same as doing things under different operating systems, you can focus on the business logic above Hadoop, if you encounter problems or lack of components, you communicate with the vendor and then upgrade to a new version.

Interestingly, these users are happy with young Hadoop, whose application lacks more corporate identity. Hadoop clusters are separated from the network and managed by a handful of people, usually one cluster runs only one application, so it is safe to postpone a task that encounters multiple users or multiple loads.

The fourth generation--New wave

We now see the companies that apply Hadoop, and they wait a long time because Hadoop has so many drawbacks that it simply defers to Hadoop. But with the advent of enterprise-class data centers, companies are also ready to run Hadoop. Waiting time is not empty etc, they carefully study the Hadoop function, spend time testing the various parts of the system, clearly know that they want a secure, multi-user, multi-load data platform, with the existing IT system integration, and with data management, security audit and integrated management functions.

Another important development I would like to point out is that the Hadoop project itself has been the result of many users working together over time from the very beginning to the current wave of users, and now Hadoop has no mark of even shred initial members, Hadoop is just one of the industries that are so common that no one can represent it, and only Doug cutting is worth mentioning.

Now that more and more enterprise-class products are turning to Hadoop, Ellison does not like this trend because it is not conducive to the sale of Oracle databases. This has actually become a shrunken version of Doug's request to explain large data to Hadoop users. In fact, a large number of users do not know the original group of people in this circle. The Times are continuing to progress.

Generation 1.1-Periodic fluctuations

The place that this circle looks close to, with my personal favorite words to describe is: technical debt. Let's go back to the first generation, but more to the second generation of users. We can see that they have accumulated a lot of legacy systems that need constant maintenance and use for the evolving Hadoop biosphere. In real business applications This requires human capital, such as developing products and providing services to increase revenue. Contrary to what was originally created, the idea that seemed great at the time is now a burden of constant spending. Hadoop has become the norm for first-generation companies, and they have modified the version of the code base to keep their hadoop working. For the second generation, they now have very rich tools to choose from, and these tools don't need to be developed by themselves, but they still need to be maintained by themselves.

So I predict that ultimately, these previous generations of users will scrap their previous efforts to migrate to the Hadoop release, which will help them focus on their business, such as successfully developing data-driven products and services, while vendors who choose Hadoop will ensure that they always have the data center they need to do so. Here's a great future!

(Responsible editor: The good of the Legacy)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.