Cloudera has courted four mainstream companies to work together to push for a combination of two big open source projects to further improve the planning of the Hadoop community's power.
Cloudera, IBM, Intel, Databricks and MAPR have established a partnership to migrate Apache Hive to Apache Spark, which was released at the Spark Summit in San Francisco this week. We have been informed last week that Cloudera will recommend combining hive with Spark.
For those of you who are not familiar with the many project codes in Hadoop, here's a short explanation: Spark is a general-purpose cluster computing system that was originally developed by the University of California and Berkeley and is based on the Hadoop filesystem. It can be used as a hadoopmapreduce alternative data processing scheme, and in-memory operating mechanism can bring 100 times times the processing speed of MapReduce-disk operating mechanism when the speed is increased by about 10 times times.
At the same time, hive is a data warehouse software designed to use SQL language to query data stored in Hadoop.
The importance of the two projects is unquestionable, and spark is considered a potential successor to the MapReduce, while Hive is recognized as the choice to implement SQL task processing on Hadoop.
By combining hive with Spark, Cloudera hopes to integrate and centralize the complex hadoop ecosystem, and it will also reduce the importance of Cloudera's own project Impala.
In our interview, Cloudera, product management director Justin Erickson, said the company had decided to push hive--because they wanted to "promote and aggregate the technical power of the spark and hive two communities, Eventually, it brings faster batch processing speed to the user in Hadoop. ”
"Hive is the standard batch solution in Hadoop today," said Matt Brandwein, head of the company's product marketing department. "We want to ease the slack in the community," he said. People are beginning to realize that there are too many options in front of them and that their respective tasks are different. Spark is the successor and successor, which needs to be emphasized and recognized. ”
The move will have an important impact on the entire Hadoop ecosystem, Cloudera naturally. In the past Cloudera companies have been sceptical about the value of hive. In a blog post published last year, the company's chief strategist, Mike Olson, wrote that "decades of experience have taught us that we must make our databases responsive, and that the hive built on MapReduce does not meet this need." ”
In order to solve this natural flaw of Hive, Cloudera company developed its own software program, namely Impala. But with the creation of a new partnership between Cloudera, MAPR, Databricks and Intel, it seems that Cloudera's attitude towards hive seems to be easing and will use the technology as the main way to communicate with the Hadoop community--and, of course, They will also continue to develop Imapala as a means of profitability.
In this context, another set of derivative schemes is really to be mentioned-it is the same technology project that wants Hive to run on spark shark. However, cloudera that there is a serious deviation between shark and the mainstream hive.
"The shark approach is to replace the many key components of hive, including the query planning mechanism and other elements of hive," Cloudera explains. "As a result of this, it becomes very difficult for shark to maintain compatibility with hive, because changes made to hive cannot be ported transparently to shark." In our own hive & Spark docking program, we only make a very limited adjustment to the physical query planning mechanism and add a variety of new features to Hive, while providing changes to spark, MapReduce, and tez in a transparent manner. In this way, its maintenance burden will be much lower than the shark project and can be more deeply integrated with the core hive community. ”
The Tez,cloudera's move has undoubtedly put pressure on hortonworks, which has been working to develop another set of competitive data-processing frameworks. However, Cloudera says that spark is similar to Tez and is only one of the options.
As the company says in the FAQ documentation, "The Spark project is not designed to replace the Tez or mpareduce in the backend domain." The existence of a variety of back-end mechanisms is certainly a good phenomenon for hive projects. Users are free to choose between using Tez, Spark, or MapReduce. Each program has its own expertise, which is more appropriate to the actual use case. And Hive's success is not entirely dependent on tez or spark's market identity. ”
In the comments on the matter, Hortonworks said that the decision to be able to move more development resources to the hive run in Spark, which is generally a good thing. "This is tantamount to admitting that the open source community drive model is the right choice," Shaun Connolly, vice president of Hortonworks company strategy, said in an interview.
Looking at the collaboration from another perspective, it can be seen that Cloudera has raised 900 million of billions of dollars--740 million of them from Intel--and it is clear that Cloudera wants to rely on the big technology giants to play a leading role in the Hadoop business.
By putting some of the money into various open-source projects related to Hadoop, Cloudera will be able to better understand the future direction of such software and further consider how to gain economic benefits from the growing number of business user groups.