In the world of real-time data, why are we clinging to the forest of Hadoop?
As an architectural solution to bulk processing, Hadoop is still the crowning son of the big Data technology world. However, according to the survey data of the 451 Research group, the actual popularity effect is still less than a prominent reputation.
Companies that have pioneered Hadoop solutions may want to slow down a bit. With the introduction of Apache Spark and a range of other technical solutions (including Storm, Kafka, etc.), we seem to be moving away from the bulk processing of Hadoop and embarking on a path to real-time future.
Batch is not
Doug Cutting, a Cloudera company, is a highly intelligent technician and a high-yielding open source developer. He has been involved in the fundamental tools of Hadoop, Lucene, and many other big data transactions.
While cutting admits the importance of real-time streaming technology, he does not care about Hadoop, which is a major task-oriented approach to bulk processing, and points out in our interview emails:
This is not to say that Hadoop's architecture should not be designed for batch processing, because batch processing is really important. In fact, batch processing, especially under MapReduce, is the ideal starting point because it is relatively easy to implement and has important practical value. Before Hadoop was born, there was no way we could use open source software to store and process petabytes of data on commercial hardware basis. The MapReduce of Hadoop helps technicians to make an important step in the area of resource capacity.
It is difficult for us to accurately measure how important the commercialization of big data is to the operation of the whole world. Before Hadoop was born, we faced countless storage and analytics capacity challenges. In this case, Hadoop allows us to have this critical capability at an affordable cost of use.
In general, Hadoop is an important precondition for the democratization of big data – or "going into an ordinary entrepreneur".
A shift to data flow processing?
But it is still very difficult to take advantage of big data to get real benefits. As Patrick McFadin, chief advocate of DataStax, said in an interview, getting real value from corporate data is not as easy as many people preach:
We've all heard about the ROI of PB-level data in storage and analytics. Google, Yahoo and Facebook did produce the ideal answer in terms of return on investment, but unfortunately many companies still can't find a way to fully analyze and use all of the data. First: Collect all the data. Second:...... Third: Profit!
There is a series of steps between data collection and the formation of profits, and these steps are actually quite cumbersome to implement. As enterprises begin to seek to improve their real-time data analysis capabilities, new technology solutions have gradually made their ideals become reality.
McFadin the key elements in this new big data stack. First, he said it should include a query system, the most typical of which include Kafka, RABBITMQ and Kinesis. Next, the enterprise also needs to have a stream processing layer, which may contain storm, Spark streaming, or Samza. In adjusting storage, businesses typically choose Cassandra, HBase, MongoDB, or a key database represented by MySQL.
The most notable concern is the areas where batch processing still applies. According to McFadin, "the bulk mechanism is still very practical in the processing field"--in particular, it is similar to a summary and deep analysis. The concept merging of batch processing and real-time technology has built up the so-called "lambda Architecture", which involves the synergy of three constituent elements: batch processing, speed, and service delivery.
In other words, batch processing still has its own significance.
Throw the batch mechanism into the history bin
But not everyone agrees with that view. For example, Justin Langseth, CEO and co-founder of Zoomdata, identified Lambda as an "unnecessary compromise" and told us in an interview that "there has been a way to process data from sources, transfer data, store data, and make data End-to-end tools for analysis and visualization, "and no need to involve a batch mechanism at all.
According to his point of view, the batch processing mechanism is an unavoidable error in a given era, a legacy of old big data:
Real-time data is clearly best handled in a streaming manner. But companies are also fully able to incorporate historical data into streaming, just as our DVR can stream "Gone with the Wind" or last week's "American Idol" to a user's TV. This difference is important because our Zoomdata company believes that data analysis in a streaming way can provide considerable scalability and flexibility, regardless of whether the data object belongs to real-time data or historical data.
But in addition to the scalability and flexibility benefits, removing the batch mechanism from the big data flow can also be a significant simplification. As Langseth says, "This can greatly simplify big data architectures because users no longer have to worry about batch windows, data recovery from batch process failures, and other kinds of hassles." ”
Why can the two sides not live in harmony?
I'm afraid it won't be possible in a short time, cutting said.
Cutting is more inclined to look at the world as a "streaming mechanism, but Cloudera's enterprise Data hub is also worth focusing on" a harmonious ecosystem, relative to the thorough dumping of old technical solutions like Hadoop into the garbage heap. In fact, he added, "I don't think there will be a significant shift in the direction of the flow mechanism in the future." Instead, the flow mechanism will be added to the processing options collection and made available for you to choose in the right circumstances. ”
More interestingly, cutting believes big-data "big bang" growth in innovation--and frankly, slow-moving corporate IT departments--has stalled, and the industry will provide a number of good ways to address the technical challenges that arise:
I think that key technology solutions like spark are no longer frequent in the overall stack, so over time we will have set up a set of standardized tools to help most users get the ideal capacity level from their big data applications. The birth of Hadoop ignited the Big Data project, the Cambrian explosion of ignition cables, but we may soon enter a normal evolutionary cycle, and these technical solutions to expand into more industries.
Scott Hirleman, DataStax's community manager, also agreed: "The batch mechanism will not be completely discarded, because a hyper-scale analysis scenario with large amounts of data will persist." "The flow analysis mechanism will raise the industry's focus," he admits, but insists that this trend will have an impact on all kinds of big data planning "too early premature".
In short, the main significance of stream analytics is to "complement" rather than "retire". It is an excellent addition to a batch-based system such as Hadoop, and does not send the veteran of the big data era directly into the nursing home.
Not just Hadoop: the future path of big data technology