Distributed computing system framework, according to the characteristics of data set, mainly divided into data-flow and streaming two kinds. Data-flow mainly data blocks for data processing data, representing: MR, Spark, and so on, I call them big data, and streaming is mainly processing the data obtained within the unit, this way, more focus on real-time, mainly including Strom, jstorm and Samza, etc. , I call them fast data.
In this article, I mainly talk about streaming related frameworks.
The first is storm, a real-time computing system that assumes that the data source is dynamic and that it can process data as if it were flowing water.
It is characterized by low latency, high performance, distributed, scalable, and fault tolerance.
The schema is shown in the following figure.
The specific concept of storm can be referred to: http://blog.csdn.net/hljlzc2007/article/details/12976211, here does not do specific introduction.
Storm is now the most stable open-source streaming framework, but the individual thinks it has two problems.
1. Storm, while supporting multiple languages to write spout and bolt-side code, but its main technical implementation is clojure, which to play big data, open-source friends brought great changes, because the language is not in Java and C + + and other popular language, so it becomes uncontrolled , it is difficult to understand and modify its details.
2. Storm can support the ability to share Hadoop cluster resources with other open source frameworks on yarn (Hadoop 2.0), but with poor performance, this needs to be improved by storm
Of course, Storm is still the king of the current open source streaming framework.
The second thing I want to say is jstorm, this is what Ali does, it is another implementation of storm, it uses the language is java.
Characteristics:
1. The client's API is basically the same as storm, and if you migrate from storm, you don't need to modify the bolt and spout code.
2. Jstrom is more stable and faster than Strom
3. Provides a number of new features
We are interested to play, project address Https://github.com/alibaba/jstorm
The third one is Samza.
Samza is a technology that is open source for LinkedIn, an open source distributed streaming system, very similar to storm. The difference is that it runs on top of Hadoop and uses its own Kafka distributed message processing system.
This is a small and beautiful project developed by Linkin, how beautiful it is.
1. Only thousands of lines of code, the completion of the function can be comparable with storm, of course, there are still a lot of shortcomings
2. Combined with Kafka, more convenient processing of data
3. Running on yarn
One of the projects I've done before is Kafka + Storm + ElasticSearch, which will be able to replace Storm with Samza in the future, and use the resources of the Hadoop cluster to do some storage and offline analysis. Both real-time and off-line analytics are running on Hadoop, and it has to be said that Samza is a great project, which can reduce the complexity of the project's growth, facilitate maintenance, or the words, small and beautiful things, more popular.
Architecture:
Samza mainly consists of three layers,
1. Stream processing Layer--Kafka
2. Execution Layer--YARN
3. Processing layer--Samza API
Samza's streaming and execution layers are pluggable, and developers can use other frameworks instead of the two technologies.
Samza provides a yarn applicationmaster, and yarn job, running outside of the cluster, with different colors representing different hosts in the image below.
The Samza client tells Yarn's resouce manager that it wants to start a samza job, yarn RM tells Yarn Node Manager to allocate space to yarn APPLICATIONMASTER,NM after a specified space, yarn Container will run Samza Task Runner.
Samza State Management
Streaming data management of the state is difficult, because the data is flowing, itself has no state, so it is necessary to rely on historical data to record applications, SAMZA provides an internal key-value database, it is based on LEVELDB, running outside the JVM, use it to store historical data. The benefits of doing this are:
1. Reduce the cost of the JVM
2. Use of internal storage, greatly improved throughput rate
3. Reduce concurrent operations
Samza processing process.
The following figure is an example of the official Samza, which calculates the number of page visits based on the member ID grouping. The ingress messages are from Machine1, 2, Exit is Machine3, we can understand that the message is scattered in different message systems (KAFKA), Samza from different Kafka read topic, after the topic is processed, sent to Machine3, Do not do too much decomposition here, the specific reference to the official documents.
Project Address: Https://github.com/apache/incubator-samza
Official documents: http://samza.incubator.apache.org/
The above gives us infinite reverie, whether Storm will stay ahead of the position, Samza can replace it, anyway, as a developer, thousands of lines of code, I can't wait to read it.