Original address
The idea of real-time business intelligence is no longer a novelty (a page on this concept appeared in Wikipedia in 2006). However, although people have been discussing such schemes for many years, I have found that many companies have not actually planned out a clear development idea or even realized the great benefits.
Why is that? One big reason is that real-time business intelligence and analytics tools are still very limited on the market today. Traditional Data Warehouse environments are focused primarily on batch processes, which can be either extremely expensive or costly--or, of course, both.
Yet there are a variety of powerful and easy-to-use open-source platforms that are starting to reverse the current negative situation. Two of the most notable projects are Apache Storm and Apache Spark, both of which provide good real-time processing capabilities for potential users. Both packages belong to the Apache Software Foundation, and the two tools have unique features and market positioning in addition to part of the intersection of functionality.
Storm: Hadoop in real-time processing
As a set of distributed computing frameworks dedicated to event flow processing, Storm's birth can be traced back to a project originally developed by Backtype, a marketing intelligence firm that was acquired by Twitter in 2011. Twitter turned the project into open source and onto the GitHub platform, but Storm eventually joined the Apache incubator program and formally became one of Apache's top projects in September 2014.
Storm is sometimes referred to as the real-time processing area of Hadoop. the Storm project's documentation looks like this: "Storm dramatically simplifies the processing of large-scale data streams, which plays a key role in the real-time processing of Hadoop in the field of batch processing." ”
In order to achieve these goals, storm has taken into account the large scale scalability of the design approach and provided fault-tolerant support for processing with a "fast and automatic restart" scheme, which effectively ensures that each tuple can be effectively handled. The Storm project defaults to a "at least once" processing coverage guarantee for messages, but users can also implement the "only once" approach as needed.
The Storm project was written primarily by Clojure and was designed to support the integration of "streams" (such as input streams) with "bolts" (i.e., processing and output modules) and to form a set of directed acyclic graphs (DAG) topologies. The storm topology runs on top of the cluster, while the Storm Scheduler distributes processing tasks to the individual work nodes in the cluster based on the specific topology configuration.
You can think of the topology roughly as the role of MapReduce in Hadoop, but the focus of storm is on the real-time, stream-based processing mechanism, so its topology defaults to run forever or until it is manually aborted. Once the topology process starts, the flow of data flows into the system and delivers the data to the plug (and the data continues to pass through the flow through the plug), which is the main implementation of the overall computational task. As the process progresses, one or more plugs write data to the database or file system and send messages to another set of external systems or provide the calculated results to the user.
One of the big advantages of the storm ecosystem is that it has a rich mix of stream types enough to fetch data from any type of source. While it is possible to write custom streams for some highly specific applications, we can always find the right solution from the vast existing source types-from the Twitter streaming API to the Apache Kafka to the JMS broker, all covered.
The presence of an adapter makes it easy to integrate with the HDFs file system, which means that storm can interoperate with Hadoop when necessary. Another big advantage of storm is its ability to support multi-lingual programming methods. Although Storm itself is based on Clojure and runs on top of the JVM, its streams and plugs can still be written in almost any language, including non-JVM languages that can take full advantage of using JSON on a standard input/output basis and thus implement inter-component communication protocols.
In general, Storm is a highly scalable, fast, and fault-tolerant open source distributed computing system with a high degree of focus on streaming. Storm is outstanding in event processing and incremental computing, and is able to process data streams in real time based on changing parameters. Although Storm provides primitives to achieve universal distribution of RPC and can theoretically be used as part of any distributed computing task, its most fundamental advantage remains in event stream processing.
Spark: A distributed processing solution for everything
As another project dedicated to real-time distributed computing tasks, Spark was originally built by the Apmlab Lab at UC Berkeley and then joined the Apache incubator project and eventually became one of the top projects in February 2014. Like storm, Spark also supports streaming-oriented processing, but this is a more generic distributed computing platform.
In view of this, we might as well consider spark as a potential alternative to the MapReduce in Hadoop-the difference is that spark can run on an existing Hadoop cluster, but relies on yarn's ability to dispatch resources. In addition to Hadoop yarn, spark can implement the same resource scheduling on a mesos basis or use its own built-in scheduling to run as a standalone cluster. It is important to note that if spark is not used in conjunction with Hadoop, some network/Distributed file systems (including NFS, AFS, etc.) are still necessary to run on the cluster so that each node can actually access the underlying data.
The Spark project is written in Scala and supports multi-language programming as much as storm-but the special API that spark offers is only supported by Scala, Java, and Python. Spark does not have a special abstraction mechanism such as "flow", but it has an adapter that can collaborate with data stored in a variety of different data sources-specifically, HDFs files, Cassandra, HBase, and S3.
The biggest highlight of the Spark project is its support for multi-processing modes and support libraries. Yes, Spark certainly supports streaming mode, but this support is only derived from one of several spark modules, and its preset modules support SQL access, graphics operations, and machine learning in addition to stream processing.
Spark also offers an extremely convenient, interactive shell that allows users to quickly build prototypes and exploratory data analysis mechanisms in real time using Scala or the Python API. When you use this interactive shell, you'll quickly find another big difference between spark and storm: Spark clearly shows a biased "functional" approach, where most API usage is implemented by a continuous method call oriented to primitive operations- This is completely different from the pattern that storm follows, which is more inclined to accomplish such tasks by creating classes and implementing interfaces. Regardless of the merits of the two schemes, the great difference in style alone is enough to help you decide which system is better suited to your needs.
Like storm, spark also attaches great importance to large scale scalability in the design, and the spark team now has a large user documentation that lists the system scenarios that run a productive cluster with thousands of nodes. In addition to this, Spark has won the Daytona Graysort competition in the last 2014 years, making it the best choice for today's load of 100TB-level data workloads. The spark team also retains multiple documents documenting how spark ETL is responsible for the operations of petabytes of production workloads.
Spark is a fast, scalable, and flexible open source distributed computing platform that is compatible with Hadoop and Mesos and supports multi-stream computing modes, including streaming, graphics-centric operations, SQL Access plus distributed machine learning. Spark's real-world expansion record is satisfying, and is as good a platform for building real-time analytics and business intelligence systems as storm.
How to choose
If your needs are focused on streaming and CEP (that is, complex event processing), and you need to build a set of targeted cluster facilities from scratch for a project, I am personally more inclined to choose storm--, especially if the existing storm flow mechanism is able to meet the needs of everyone's integration. This conclusion is not a mandatory requirement or a mandatory rule, but the existence of these factors is indeed more suitable for the storm to take care of.
On the other hand, if you intend to use existing Hadoop or Mesos clusters, and/or the established process needs to involve other substantive requirements related to graphics processing, SQL access, or batch processing, then spark deserves to be prioritized.
Another factor to consider is the ability of the two systems to support multiple languages, for example, if you need to use code written by the R language or other languages that spark cannot natively support, then storm undoubtedly has an edge in language support. In the same vein, if you have to use an interactive shell to implement data exploration through API calls, Spark can also bring the great power that storm doesn't have.
Finally, you may want to make a detailed analysis of the two platforms before making a decision. I recommend that you first build a small-scale proof-of-concept project with each of these two platforms-and then run your own benchmark workloads to see if the workload processing capabilities are consistent with the expectations before the final choice.
Of course, we do not have to choose between the two. Depending on your workloads, infrastructure, and specific requirements, we may find an ideal solution for combining storm with spark-and other tools that may also work include Kafka, Hadoop, Flume, and so on. And this is the biggest highlight of the open source mechanism.
Whichever scenario you choose, the presence of these tools actually shows that the rules of the game in the real-time business intelligence market have changed. The powerful options that once dominated only a handful of elites have now entered the ordinary people's homes-or at least for most medium-sized or large-scale enterprises. Don't waste your resources and enjoy the convenience that comes with it.
Apache Storm and Spark: How to process data in real time and choose "Translate"