April 19, 2014 Spark Summit China 2014 will be held in Beijing. The Apache Spark community members and business users at home and abroad will be gathered in Beijing for the first time. Spark contributors and front-line developers from AMPLab, Databricks, Intel, Taobao, NetEase, and others will share their Spark project experience and best practices in production environments.
MapR is a well-known Hadoop provider that recently added a complete Spark stack to its Hadoop distribution. This is a wise move, but also shows that Spark is likely to become the future data processing framework.
MapR was also a pioneer in using Apache Spark, and on Tuesday, MapR announced that it will integrate the Spark stack into its Hadoop release as part of a partnership with Spark startup Databricks (Opm Stoica, Founder and CEO, pictured above). Spark makes it easier to handle big data workloads and makes programming big data workloads easier.
Spark was originally a memory processing framework developed by the University of California, Berkeley, and then it became popular, but its real rise was in September 2013 - officially launched by Databricks. Subsequently, Cloudera integrated Spark into its Hadoop distribution as part of its partnership with Databricks. At the same time, many Hadoop-designed projects and companies plan to either support Spark or move directly to Spark.
These include Cloudera's Oryx project, startup startup Platfora, and even the Apache Mahout project, as well as companies that participate in the Databricks certification process.
Spark is so prevalent now because it does what MapReduce does and does what MapReduce does not do. MapReduce is a traditional Hadoop data processing framework, it is slow (it uses a batch), programming cumbersome. Spark is fast and flexible - it makes Spark better able to handle tasks such as machine learning, graphics and interactive querying - and is easy to program. Spark is written in Scala, but it also supports Java, Python and R languages.
YARN is a resource management system that is also part of Hadoop 2.0. YARN allows multiprocessing frameworks to run in the same cluster, all with permissions to access the Hadoop distributed file system for storage. This makes it possible for Hadoop to support Spark.
The most interesting part of MapR's news is that MapR provides full support for the Spark stack - including the Shark SQL query engine (which is essentially a faster Apache Hive) and MLLib machine learning library - whereas Cloudera does not Support Shark. This is probably because Cloudera is still pushing its Impala SQL query engine, which MapReduce does not include. MapR has been leading the development of interactive SQL query project Apache Drill; in addition, MapR also adds native support for HP Vertica as Drill arrives.
From MapR's perspective, its position in the industry has been enhanced by the integration of Spark's capabilities as a user requirement (previously, MapR received far less attention than its competitors, Cloudera and Hortonworks). For example, MapR now has its own HBase NoSQL datastore, which has more complete data storage capabilities than the open source version included with other Hadoop distributions.
It's just that Spark-like technologies - and any technology that can run on YARN - make Hadoop an emerging force with the potential to subvert existing vendors in the data industry. Apache Hadoop has always provided cheap, open source storage, but ecosystems have now turned Hadoop into a platform that can do many things over data. In the next few years, we will see more analytic applications, and even databases that use Spark or similar technologies as engines.